Objectives:

https://blogs.bmj.com/bmj/2020/07/03/covid-19-has-highlighted-the-inadequate-and-unequal-access-to-high-quality-green-spaces/ https://www.weforum.org/agenda/2020/08/parks-green-spaces-mental-health-access-equality/

Scale of analysis: MSOA (2 of the 3 datasets at this granularity) Geographic scope: England (case data at MSOA level available for England only)

1. Data sources

1.1. Access to private green space (gardens)

The ONS (Office for National Statistics) provides data on access to private green space (i.e. access to gardens) for each MSOA in Great Britain. Here I am using the most recent April 2020 edition of the data. I quickly, manually edited the ONS excel file to make it easier use the read_excel function for data import. Given it is unlikely that the ONS data will be updated during the course of this analysis, it was preferable to go for the quicker manual process than investing time in a re-producible programmatic approach.

private_green_space <- read_excel("./Data/osprivateoutdoorspacereferencetables.xlsx", sheet = "MSOA gardens")
private_green_space %>% 
  sample_n_from_df(10)

1.2. Access to public green space (parks)

The ONS (Office for National Statistics) also provides data on access to public green space (i.e. access to parks and playing fields) for each Lower Super Output Area (LSOA) in Great Britain. In this case no manual editing of the ONS excel file was required.

It should be noted that this access to public green space data is provided at a finer geographic resolution than the other datasets used in this analysis, which are provided at MSOA granularity. Each MSOA is broken down in a number of LSOAs, so later on it will be straight forward (with the help of a lookup table) to aggregate this data to enable comparision with the other data sources.

public_green_space <-  read_excel("./Data/ospublicgreenspacereferencetables.xlsx", sheet = "LSOA Parks and Playing Fields")
public_green_space %>% 
  sample_n_from_df(10)

1.3. Covid cases

data.gov.uk provides data on the number of Covid-19 cases (as confirmed by positive tests) occurring in each English MSOA. This data is provided in the form of weekly case numbers. A value of -99 is used if in a given week, zero, one or two cases are reported in a MSOA. This is presumably to reduce the risk of individual cases reported within the data being identified.
Below I import the data and check how NA values are used within the dataset, as no details are provided in the limited documentation provided by data.gov.uk. No NA are present for variables reporting the number of cases occurring. This means I can replace the -99 values with NA, making it easier to work with the case data in the remainder of this notebook.

cases <- read_csv("./Data/MSOAs_latest.csv")

# Check how NA values used in the case data
na_count <- cases %>% 
  map(~ sum(is.na(.)))

# reimport case data, replacing -99 in case columns with NA and defining column types
# (where readr initial guesses have been unsuccessful)

cases <- read_csv("./Data/MSOAs_latest.csv", 
                  na = c(-99, "NA"),
                  col_types = cols(wk_05 = "d",
                                   wk_06 = "d",
                                   wk_07 = "d",
                                   wk_08 = "d",
                                   wk_09 = "d")
                  )
cases %>% 
  sample_n_from_df(10)

The data.gov.uk case data records cases across multiple columns. Each of the columns corresponds to a numbered week (starting from week 5 in 2020). So, below I transform the case data so it is in the tidy format expected by tidyverse tools (e.g. ggplot2).

# select all column names for variables reporting case data
case_colnames <- colnames(cases)[str_detect(colnames(cases),"^wk_")]

# find the first and last columns containing weekly cases
first_case_col <- case_colnames[1]
last_case_col <- case_colnames[length(case_colnames)]

# Tidy data and process dates
cases_tidy <- cases %>% 
  pivot_longer(cols = all_of(first_case_col):all_of(last_case_col), 
               names_to = "week", values_to = "cases") %>%
  
  # convert week numbers to w/c dates
  mutate(week = as.integer(str_sub(week, 4, 5))) %>%
  mutate(week_commencing = lubridate::ymd( "2019-12-30" ) + lubridate::weeks(week - 1))

cases_tidy

1.4. Geographic lookup tables

As noted above, some geographic aggregation of the access to public green space data will be required. Specifically, aggregation from the LSOA to MSOA level. A look-up table for this mapping from LSOAs to MSOAs is available for the ONS and is imported below.

## Parsed with column specification:
## cols(
##   OA11CD = col_character(),
##   OAC11CD = col_character(),
##   OAC11NM = col_character(),
##   LSOA11CD = col_character(),
##   LSOA11NM = col_character(),
##   SOAC11CD = col_character(),
##   SOAC11NM = col_character(),
##   MSOA11CD = col_character(),
##   MSOA11NM = col_character(),
##   LAD17CD = col_character(),
##   LAD17NM = col_character(),
##   LACCD = col_character(),
##   LACNM = col_character(),
##   RGN11CD = col_character(),
##   RGN11NM = col_character(),
##   CTRY11CD = col_character(),
##   CTRY11NM = col_character(),
##   FID = col_double()
## )

2. Identifying variables of interest

2.1. Private green space

two sub groups (Flats and Houses) within the data, at this stage interested in both (also provided)

priv_gs_focus <- private_green_space %>% 
  select(`MSOA code`, `MSOA name`, `T: Address count`, 
         `T: Percentage of adresses with private outdoor space`,
         `T: Average size of private outdoor space (m2)`
         ) %>% 
  rename(address_count = `T: Address count`,
         perc_garden = `T: Percentage of adresses with private outdoor space`,
         ave_garden_size = `T: Average size of private outdoor space (m2)`)

There 15 NA values present across all columns in the priv_gs_focus dataframe. Given the number of NAs is small relative to the size of the dataframe (8483, 5), and this is an exploratory data analysis I’ll leave NAs to be handled/removed by the plotting functions used below.

How to plot boxplot and histogram and align

format_hist_box_pair_plot <- function(x_lims){
  list(scale_x_continuous(limits= x_lims), 
       theme_minimal()
       )
}

hist_box_pair_plot <- function(df, x_var, x_lims, x_label, y_label,
                               colour = "grey40", ...){
  
  histogram <- ggplot(data = df,
              mapping = aes_string(x = x_var))
  
  histogram <- histogram + geom_histogram(fill = colour, binwidth = 0.01) +
    
    scale_x_continuous(labels = NULL) +
    labs(x = NULL, y = y_label) +
    format_hist_box_pair_plot(x_lims)
    
  box_plot <- ggplot(data = df,
              mapping = aes_string(x = x_var))
  
  box_plot <- box_plot + geom_boxplot(colour = colour) +

    scale_y_continuous(labels = NULL) +
    labs(x = x_label) +
    format_hist_box_pair_plot(x_lims)
  
  #combine and align the histogram and box plot
  cowplot::plot_grid(histogram, box_plot,
                     ncol = 1, rel_heights = c(3, 1),
                     align = 'v', axis = 'lr')
}

hist_box_pair_plot(priv_gs_focus, "perc_garden",
                   x_label = "Percentage of adresses with private outdoor space",
                   y_label = "number of MSOAs",
                   x_lims = c(0, 1)
                   ) 
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
## Warning: Removed 3 rows containing non-finite values (stat_boxplot).

https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html

var_summary <- function(df, var, na.rm = TRUE) {
 
  summary <- df %>% 
    summarise(min = min({{var}}, na.rm = na.rm),
              q1 = quantile({{var}}, probs = 0.25, na.rm = na.rm),
              median = median({{var}}, na.rm = na.rm),
              q3 = quantile({{var}}, probs = 0.75, na.rm = na.rm),
              max = max({{var}}, na.rm = na.rm),
              mean = mean({{var}}, na.rm = na.rm),
              sd = sd({{var}}, na.rm = na.rm)
              ) 
  
  summary[1, "variable"] <- toString(deparse(substitute(var)))
  summary %>% 
    relocate(variable, .before = min)
  
}

var_summary(priv_gs_focus, perc_garden)
p <- ggplot(data = subset(priv_gs_focus, subset = ave_garden_size < 2000), 
            mapping = aes(x = ave_garden_size)
            )

p + geom_histogram(bins = 100)

p <- ggplot(data = priv_gs_focus,
            mapping = aes(x = log10(ave_garden_size))
            )

p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).

2.2. Public green space

2.3. Covid-19 cases

3. Analysing the relationships between access to green space and Covid-19 cases